Supervised Learning - Foundations Project: ReCell¶

Problem Statement¶

Business Context¶

Buying and selling used phones and tablets used to be something that happened on a handful of online marketplace sites. But the used and refurbished device market has grown considerably over the past decade, and a new IDC (International Data Corporation) forecast predicts that the used phone market would be worth \$52.7bn by 2023 with a compound annual growth rate (CAGR) of 13.6% from 2018 to 2023. This growth can be attributed to an uptick in demand for used phones and tablets that offer considerable savings compared with new models.

Refurbished and used devices continue to provide cost-effective alternatives to both consumers and businesses that are looking to save money when purchasing one. There are plenty of other benefits associated with the used device market. Used and refurbished devices can be sold with warranties and can also be insured with proof of purchase. Third-party vendors/platforms, such as Verizon, Amazon, etc., provide attractive offers to customers for refurbished devices. Maximizing the longevity of devices through second-hand trade also reduces their environmental impact and helps in recycling and reducing waste. The impact of the COVID-19 outbreak may further boost this segment as consumers cut back on discretionary spending and buy phones and tablets only for immediate needs.

Objective¶

The rising potential of this comparatively under-the-radar market fuels the need for an ML-based solution to develop a dynamic pricing strategy for used and refurbished devices. ReCell, a startup aiming to tap the potential in this market, has hired you as a data scientist. They want you to analyze the data provided and build a linear regression model to predict the price of a used phone/tablet and identify factors that significantly influence it.

Data Description¶

The data contains the different attributes of used/refurbished phones and tablets. The data was collected in the year 2021. The detailed data dictionary is given below.

  • brand_name: Name of manufacturing brand
  • os: OS on which the device runs
  • screen_size: Size of the screen in cm
  • 4g: Whether 4G is available or not
  • 5g: Whether 5G is available or not
  • main_camera_mp: Resolution of the rear camera in megapixels
  • selfie_camera_mp: Resolution of the front camera in megapixels
  • int_memory: Amount of internal memory (ROM) in GB
  • ram: Amount of RAM in GB
  • battery: Energy capacity of the device battery in mAh
  • weight: Weight of the device in grams
  • release_year: Year when the device model was released
  • days_used: Number of days the used/refurbished device has been used
  • normalized_new_price: Normalized price of a new device of the same model in euros
  • normalized_used_price: Normalized price of the used/refurbished device in euros

Importing necessary libraries¶

In [1]:
#manipulating and analyzing data
import numpy as np 
import pandas as pd 

#data visualization
import seaborn as sns
import matplotlib.pyplot as plt

#splitting data into train and test 
from sklearn.model_selection import train_test_split

#making linear regression model
import statsmodels.api as sm

#checking performance of model
from sklearn.metrics import mean_absolute_error, mean_squared_error

Loading the dataset¶

In [2]:
#assigning devices dataframe to the data stored in the csv file
devices = pd.read_csv('used_device_data.csv')

Data Overview¶

In [3]:
#viewing first 5 rows of data
devices.head()
Out[3]:
brand_name os screen_size 4g 5g main_camera_mp selfie_camera_mp int_memory ram battery weight release_year days_used normalized_used_price normalized_new_price
0 Honor Android 14.50 yes no 13.0 5.0 64.0 3.0 3020.0 146.0 2020 127 4.307572 4.715100
1 Honor Android 17.30 yes yes 13.0 16.0 128.0 8.0 4300.0 213.0 2020 325 5.162097 5.519018
2 Honor Android 16.69 yes yes 13.0 8.0 128.0 8.0 4200.0 213.0 2020 162 5.111084 5.884631
3 Honor Android 25.50 yes yes 13.0 8.0 64.0 6.0 7250.0 480.0 2020 345 5.135387 5.630961
4 Honor Android 15.32 yes no 13.0 8.0 64.0 3.0 5000.0 185.0 2020 293 4.389995 4.947837
In [4]:
#viewing last 5 rows of data
devices.tail()
Out[4]:
brand_name os screen_size 4g 5g main_camera_mp selfie_camera_mp int_memory ram battery weight release_year days_used normalized_used_price normalized_new_price
3449 Asus Android 15.34 yes no NaN 8.0 64.0 6.0 5000.0 190.0 2019 232 4.492337 6.483872
3450 Asus Android 15.24 yes no 13.0 8.0 128.0 8.0 4000.0 200.0 2018 541 5.037732 6.251538
3451 Alcatel Android 15.80 yes no 13.0 5.0 32.0 3.0 4000.0 165.0 2020 201 4.357350 4.528829
3452 Alcatel Android 15.80 yes no 13.0 5.0 32.0 2.0 4000.0 160.0 2020 149 4.349762 4.624188
3453 Alcatel Android 12.83 yes no 13.0 5.0 16.0 2.0 4000.0 168.0 2020 176 4.132122 4.279994

Observations:¶

Each row in the dataset holds information for one device.

In [5]:
#viewing the number of rows and columns in the data
devices.shape
Out[5]:
(3454, 15)

Observations:¶

There are 3,454 rows and 15 columns in the devices dataframe.

In [6]:
#viewing information about each column and its datatype
devices.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3454 entries, 0 to 3453
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   brand_name             3454 non-null   object 
 1   os                     3454 non-null   object 
 2   screen_size            3454 non-null   float64
 3   4g                     3454 non-null   object 
 4   5g                     3454 non-null   object 
 5   main_camera_mp         3275 non-null   float64
 6   selfie_camera_mp       3452 non-null   float64
 7   int_memory             3450 non-null   float64
 8   ram                    3450 non-null   float64
 9   battery                3448 non-null   float64
 10  weight                 3447 non-null   float64
 11  release_year           3454 non-null   int64  
 12  days_used              3454 non-null   int64  
 13  normalized_used_price  3454 non-null   float64
 14  normalized_new_price   3454 non-null   float64
dtypes: float64(9), int64(2), object(4)
memory usage: 404.9+ KB

Observations:¶

  • brand_name, os, 4g, and 5g are object datatypes, and categorical variables.
  • screen_size, main_camera_mp, selfie_camera_mp, int_memory, ram, battery, weight, normalized_used_price, and normalized_new_price are all float datatypes, and numerical variables.
  • release_year and days_used are both integer datatypes, and days_used is a numerical variable.
  • Although release_year is an integer, it seems to function as a categorical variable.
  • There are missing values in main_camera_mp, selfie_camera_mp, int_memory, ram, battery, and weight because their non-null count is not equal to 3,454.
In [7]:
#viewing statistical summary of all of the columns in the devices dataframe
#using include='all' to make sure the summary also includes categorical variables
devices.describe(include='all')
Out[7]:
brand_name os screen_size 4g 5g main_camera_mp selfie_camera_mp int_memory ram battery weight release_year days_used normalized_used_price normalized_new_price
count 3454 3454 3454.000000 3454 3454 3275.000000 3452.000000 3450.000000 3450.000000 3448.000000 3447.000000 3454.000000 3454.000000 3454.000000 3454.000000
unique 34 4 NaN 2 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
top Others Android NaN yes no NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
freq 502 3214 NaN 2335 3302 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
mean NaN NaN 13.713115 NaN NaN 9.460208 6.554229 54.573099 4.036122 3133.402697 182.751871 2015.965258 674.869716 4.364712 5.233107
std NaN NaN 3.805280 NaN NaN 4.815461 6.970372 84.972371 1.365105 1299.682844 88.413228 2.298455 248.580166 0.588914 0.683637
min NaN NaN 5.080000 NaN NaN 0.080000 0.000000 0.010000 0.020000 500.000000 69.000000 2013.000000 91.000000 1.536867 2.901422
25% NaN NaN 12.700000 NaN NaN 5.000000 2.000000 16.000000 4.000000 2100.000000 142.000000 2014.000000 533.500000 4.033931 4.790342
50% NaN NaN 12.830000 NaN NaN 8.000000 5.000000 32.000000 4.000000 3000.000000 160.000000 2015.500000 690.500000 4.405133 5.245892
75% NaN NaN 15.340000 NaN NaN 13.000000 8.000000 64.000000 4.000000 4000.000000 185.000000 2018.000000 868.750000 4.755700 5.673718
max NaN NaN 30.710000 NaN NaN 48.000000 32.000000 1024.000000 12.000000 9720.000000 855.000000 2020.000000 1094.000000 6.619433 7.847841
In [8]:
#viewing the statistical summary for numerical columns only
devices.describe()
Out[8]:
screen_size main_camera_mp selfie_camera_mp int_memory ram battery weight release_year days_used normalized_used_price normalized_new_price
count 3454.000000 3275.000000 3452.000000 3450.000000 3450.000000 3448.000000 3447.000000 3454.000000 3454.000000 3454.000000 3454.000000
mean 13.713115 9.460208 6.554229 54.573099 4.036122 3133.402697 182.751871 2015.965258 674.869716 4.364712 5.233107
std 3.805280 4.815461 6.970372 84.972371 1.365105 1299.682844 88.413228 2.298455 248.580166 0.588914 0.683637
min 5.080000 0.080000 0.000000 0.010000 0.020000 500.000000 69.000000 2013.000000 91.000000 1.536867 2.901422
25% 12.700000 5.000000 2.000000 16.000000 4.000000 2100.000000 142.000000 2014.000000 533.500000 4.033931 4.790342
50% 12.830000 8.000000 5.000000 32.000000 4.000000 3000.000000 160.000000 2015.500000 690.500000 4.405133 5.245892
75% 15.340000 13.000000 8.000000 64.000000 4.000000 4000.000000 185.000000 2018.000000 868.750000 4.755700 5.673718
max 30.710000 48.000000 32.000000 1024.000000 12.000000 9720.000000 855.000000 2020.000000 1094.000000 6.619433 7.847841

Observations:¶

  • There are 34 unique values in brand name, and 4 unique values in os.
  • There are only 2 unique values for 4g and 5g, so the only values for these columns are yes and no.
  • main_camera_mp ranges from 0.08 megapixels to 48 megapixels, and selfie_cameria_mp ranges from 0 megapixels to 32 megapixels.
  • int_memory ranges from 0.01 GB to 1,024 GB, and ram ranges from 0.02 GB to 12 GB.
  • battery ranges from 500 mAh to 9,720 mAh, and weight ranges from 69 grams to 855 grams.
  • screen_size ranges from 5.08 cm to 30.71 cm, and days_used ranges from 91 days to 1,094 days.
  • The earliest release_year is 2013 and the latest is 2020.
  • The normalized_used_price ranges from about 1.54 euros to 6.62 euros, and the normalized_new_prices ranges from about 2.90 euros to 7.85 euros.
In [9]:
#viewing missing values in the devices dataframe
devices.isnull().sum()
Out[9]:
brand_name                 0
os                         0
screen_size                0
4g                         0
5g                         0
main_camera_mp           179
selfie_camera_mp           2
int_memory                 4
ram                        4
battery                    6
weight                     7
release_year               0
days_used                  0
normalized_used_price      0
normalized_new_price       0
dtype: int64

Observations:¶

  • There are 179 missing values in the main_camera_mp column, and 2 missing values in the selfie_camera_mp column.
  • There are 4 missing values each in the int_memory column and in the ram column.
  • There are 6 missing values in the battery column, and 7 missing values in the weight column.
In [10]:
#viewing duplicate values in the devices dataframe
devices.duplicated().sum()
Out[10]:
0

Observations:¶

There are no duplicate values.

Exploratory Data Analysis (EDA)¶

Univariate Analysis¶

In [11]:
#creating histogram for brand_name column to examine distribution of brand_name/color is set to thistle
sns.histplot(data=devices, x='brand_name', color='thistle') #data from devices is used to plot brand_name on x-axis
plt.xticks(rotation=90, size=9); #x-axis labels are rotated 90 degrees and their size is set to 9
plt.title('brand_name') #title is set to brand_name (column name)
plt.show(); #displaying graph

Observations:¶

  • There are more devices that belong to the others category than to a specified brand.
  • The second highest number of devices can be seen in the brand Samsung.
  • Infinix has the lowest number of used devices in this dataframe.
In [12]:
#creating histogram for each of the other columns in the devices dataframe
for i in devices.columns:  #for loop for each column in devices
    if (i=='brand_name'): #condition for if the column is brand_name
        pass #skipping brand_name because it is already plotted above
    else: #condition for all of the other columns/color is set to thistle
        sns.histplot(data=devices, x=i, color='thistle') #using data from devices to plot the column on x-axis
        plt.title(i) #title of graph is set to column name
        plt.show(); #displaying graph

Observations:¶

  • Most of the devices in this dataframe have an Android operating system.
  • There are more devices that have a screen size close to 13 cm.
  • There are more devices that are 4g.
  • More devices have a main camera that is close to 13 megapixels than other sizes.
  • There are more devices with a selfie camera that is 5 megapixels compared to other sizes.
  • Most of the devices have an internal memory that is lower than 200 GB.
  • Most of the devices have a RAM of 4 GB.
  • The data for battery varies more and it is slightly skewed to the right.
  • The data for weight is skewed to the right.
  • There are more devices that were released in 2014 than any other year.
  • The data for days used is slightly skewed to the left because there are more devices that were used for 600 days or more.
  • Normalized new price and normalized used price both seem to follow a normal distribution, but the normalized new price is slightly higher than normalized used price which makes sense because the price for used devices is generally lower than new device price.

Bivariate Analysis¶

In [13]:
#assigning all_columns variable for all the variables except brand_name
all_columns = devices.columns.drop('brand_name')
#using one pair grid for all columns would be overwhelming
#first few columns used in the grid are assigned a new variable first_var
first_var = devices.loc[:,['screen_size','main_camera_mp','selfie_camera_mp','int_memory','ram']]
#the other columns are assigned last_var variable
last_var = devices.loc[:,['battery','weight','release_year','days_used']]

#creating a pair grid using data from devices datafrane
#y-variables in the grid are all columns except brand_name
#x-variables are columns in first_var
#.map() function used with sns.scatterplot to make scatterplots for the columns in the pair grid
sns.PairGrid(data=devices, y_vars=all_columns, x_vars=first_var).map(sns.scatterplot)
plt.show(); #displaying the pair grid
In [14]:
#creating pair grid with data from devices dataframe
#y_variables are all columns except brand_name
#x-variables are columns in last_var
#.map() function used with sns.scatterplot to make scatterplot for the columns in the pair grid
sns.PairGrid(data=devices, y_vars=all_columns, x_vars=last_var).map(sns.scatterplot)
plt.show(); #displaying pair grid

Observations:¶

  • Devices with larger screen sizes seem to be higher in weight.
  • Devices with a larger battery size seem to have more weight.
  • Devices with a larger screen size also seem to have more weight.
  • Devices released in more recent years are used for less days than devices released in earlier years.
  • Devices that weight more seem to have higher normalized used and new prices.
  • Devices with larger battery sizes also seem to have higher normalized used and new prices.
In [15]:
#assigning a variable called devices_numerical to only the numerical columns in the devices dataframe
devices_numerical = devices.select_dtypes(np.number)

#creating heatmap with numerical columns in devices to show correlation in shades of blue color
#the number labels are shown with annot=True and the range is from -1 to 1
#labels are limited to 2 decimal spaces
sns.heatmap(devices_numerical.corr(), annot=True, vmin=-1, vmax=1, fmt='.2f', cmap='Blues')
plt.show(); #displaying heatmap

Observations:¶

  • Battery and screen size have a high positive correlation.
  • Weight and screen size have a high positive correlation.
  • Battery and weight are also very positively correlated.
  • Normalized used price and normalized new price have a high positive correlation.
  • The highest negative correlation can be seen in release year and days used.

Questions:¶

1) What does the distribution of normalized used device prices look like?

In [16]:
#creating histogram using data from devices/plotting normalized_used_price on x-axis
#density curve is shown using kde=True, and color is set to light steel blue
sns.histplot(data=devices, x='normalized_used_price', kde=True, color='lightsteelblue')
plt.title('Distribution of Normalized Used Device Prices') #setting title of histogram
plt.xlabel('Normalized Used Device Price') #setting title of x-axis
plt.ylabel('Number of Devices') #setting title of y-axis
plt.show(); #displaying histogram

Observations:¶

  • The distribution of normalized used device price seems to follow a normal distribution because it takes the form of a bell-shaped curve.
  • However, the data seems to be very slightly skewed to the left.

2) What percentage of the used device market is dominated by Android devices?

In [17]:
#creating histogram using data from devices with os on the x-axis
#color is set to light steel blue
sns.histplot(data=devices, x='os',color='lightsteelblue')
plt.title('Operating System for Used Devices') #setting title of histogram
plt.xlabel('Operating System') #setting title of x-axis
plt.ylabel('Number of Devices') #setting title of y-axis
plt.show(); #displaying histogram
In [18]:
#calculating the percent of android devices
#finding the number of rows where the os column in devices is android using shape[0]
#and dividing that number by the total number of os columns in devices
#multiplying that number by 100 to get the percentage
#assigning the percentage to a variable called percent_android
percent_android = (devices[devices['os'] == 'Android'].shape[0]/devices['os'].shape[0])*100

#printing the value of percent_android
print(percent_android)
93.05153445280834

Observations:¶

93% of the used devices have an Android operating system.

3)The amount of RAM is important for the smooth functioning of a device. How does the amount of RAM vary with the brand?

In [19]:
#creating scatterplot using data from devices/plotting brand_name on x-axis and ram on y-axis
#color is set to light steel blue
sns.scatterplot(data=devices, x='brand_name', y='ram',color='lightsteelblue')
plt.xticks(rotation=90,size=9) #x-axis labels are rotated 90 degrees and size is set to 9 to avoid overcrowding
plt.title('RAM vs. Brand Name') #setting title of scatterplot
plt.xlabel('Brand Name') #setting title of x-axis
plt.ylabel('RAM') #setting title of y-axis
plt.show(); #displaying scatterplot
In [20]:
#creating histogram using data from devices and placing ram on x-axis
#color is set to light steel blue
sns.histplot(data=devices, x='ram', color='lightsteelblue')
plt.title('Distribution of RAM') #setting title of histogram
plt.xlabel('RAM') #setting title of x-axis
plt.ylabel('Number of Devices') #setting title of y-axis
plt.show(); #displaying histogram

Observations:¶

  • A high majority of the used devices have a RAM of 4 GB.
  • The RAM does not seem to vary much based on different brands.
  • The devices that some brands have fall within a larger range of RAM than other brands, but the RAM of 4 GB seems to be evenly distributed among the brands because the dots at 4 GB are spaced consistently.

4) A large battery often increases a device's weight, making it feel uncomfortable in the hands. How does the weight vary for phones and tablets offering large batteries (more than 4500 mAh)?

In [21]:
#creating scatterplot using data from devices/plotting battery on x-axis and weight on y-axis
#color is set to light steel blue
sns.scatterplot(data=devices,x='battery', y='weight', color='lightsteelblue')
plt.title('Weight vs. Battery') #setting title of scatterplot
plt.xlabel('Battery') #setting title of x-axis
plt.ylabel('Weight') #setting title of y-axis
plt.show(); #displaying scatterplot

Observations:¶

  • Devices with larger battery size (more than 4500 mAh) seem to have more weight than devices with smaller battery size.
  • The range of weight tends to get more narrower for devices that have a battery size of over 4500 mAh. The weight varies less than devices with smaller battery size.

5) Bigger screens are desirable for entertainment purposes as they offer a better viewing experience. How many phones and tablets are available across different brands with a screen size larger than 6 inches?

In [22]:
#converting cm to 6 inches
#1 in = 2.54 cm
#6 in = 2.54(6) cm
six_inches = 2.54*6

#assigning a variable called large screens to devices where screen_size is greater than 6 inches
large_screens = devices[devices['screen_size'] > six_inches ]

#displaying the number of screen_size values there are in the large_screens variable using .count() function
#organizing the number of screen_size values by brand_name using .groupby(by=['brand_name']) function
#organizing these by descending order using .sort_values(ascending=False) function
large_screens.groupby(by=['brand_name'])['screen_size'].count().sort_values(ascending=False)
Out[22]:
brand_name
Huawei       149
Samsung      119
Others        99
Vivo          80
Honor         72
Oppo          70
Xiaomi        69
Lenovo        69
LG            59
Motorola      42
Asus          41
Realme        40
Alcatel       26
Apple         24
Acer          19
ZTE           17
Meizu         17
OnePlus       16
Nokia         15
Sony          12
Infinix       10
HTC            7
Micromax       7
Google         4
Gionee         3
XOLO           3
Coolpad        3
Karbonn        2
Panasonic      2
Spice          2
Microsoft      1
Name: screen_size, dtype: int64

Observations:¶

  • Huawei has the most used devices with larger screen size.
  • Samsung has the next highest amount of large screen used devices.
  • Microsoft has only one large screen used device.

6) A lot of devices nowadays offer great selfie cameras, allowing us to capture our favorite moments with loved ones. What is the distribution of devices offering greater than 8MP selfie cameras across brands?

In [23]:
#assigning variable called large_selfie_camera to devices where selfie_camera_mp is larger than 8
large_selfie_camera = devices[devices['selfie_camera_mp'] > 8]

#finding the total number of selfie_camera_mp values in large_selfie_camera using .count() function
#organizing those values by brand_name using .groupby(by='brand_name') function
#organizing those values into descending order using .sort_values(ascending=False) function
large_selfie_camera.groupby(by = ['brand_name'])['selfie_camera_mp'].count().sort_values(ascending=False)
Out[23]:
brand_name
Huawei        87
Vivo          78
Oppo          75
Xiaomi        63
Samsung       57
Honor         41
Others        34
LG            32
Motorola      26
Meizu         24
ZTE           20
HTC           20
OnePlus       18
Realme        18
Sony          14
Lenovo        14
Nokia         10
Asus           6
Infinix        4
Gionee         4
Coolpad        3
Micromax       2
Panasonic      2
BlackBerry     2
Acer           1
Name: selfie_camera_mp, dtype: int64

Observations:¶

  • Huawei has the most number of used devices with a large selfie camera.
  • Vivo has the second highest number of used devices with a large selfie camera.
  • Acer has only one used device with a large selfie camera.

7) Which attributes are highly correlated with the normalized price of a used device?

In [24]:
#creating heatmap showing correlation of the perviously created variable devices_numerical (only numerical columns in devices dataframe)
#labels are shown with annot=true, and correlation values are from -1 to 1
#labels are limited to 2 decimal points and color is in shades of blue
sns.heatmap(devices_numerical.corr(), annot=True, vmin=-1, vmax=1, fmt='.2f', cmap='Blues')
plt.show(); #displaying heatmap

Observations:¶

  • Screen size and battery are highly positively correlated with normalized_used_price.
  • Selfie Camera MP is also highly positively correlated with normalized_used_price.
  • normalized_new_price is also highly positively correlated with normalized_new_price which makes sense because devices with higher new prices also tend to also have higher used prices.

Data Preprocessing¶

Missing Value Treatment¶

In [25]:
#viewing missing values in devices
devices.isnull().sum()
Out[25]:
brand_name                 0
os                         0
screen_size                0
4g                         0
5g                         0
main_camera_mp           179
selfie_camera_mp           2
int_memory                 4
ram                        4
battery                    6
weight                     7
release_year               0
days_used                  0
normalized_used_price      0
normalized_new_price       0
dtype: int64

Observations:¶

  • main_camera_mp has 179 missing values.
  • selfie_camera_mp has 2 missing values.
  • int_memory has 4 missing values.
  • ram has 4 missing values.
  • battery has 6 missing values.
  • weight has 7 missing values.
  • Since the above distributions of these columns are skewed to some extent in the univariate analysis, it is best to fill the null values with the median of these columns based on their brand_name.
In [26]:
#assigning a new variable to the dataframe copy because values need to be filled in without changing the original dataset
devices1 = devices.copy()
In [27]:
#filling in the null values of the main_camera_mp column in devices1 dataframe with the median of the column values when they are organized by brand_name
devices1['main_camera_mp'] = devices1['main_camera_mp'].fillna(value=devices1.groupby(by=['brand_name'])['main_camera_mp'].transform('median'))
#filling null values of selfie_camera_mp column with median of column values when organized by brand_name
devices1['selfie_camera_mp'] = devices1['selfie_camera_mp'].fillna(value=devices1.groupby(by=['brand_name'])['selfie_camera_mp'].transform('median'))
#filling null values of int_memory column with median of column values when organized by brand_name
devices1['int_memory'] = devices1['int_memory'].fillna(value=devices1.groupby(by=['brand_name'])['int_memory'].transform('median'))
#filling null values of ram column with median of column values when organized by brand_name
devices1['ram'] = devices1['ram'].fillna(value=devices1.groupby(by=['brand_name'])['ram'].transform('median'))
#filling null values of battery column with median of column values when organized by brand_name
devices1['battery'] = devices1['battery'].fillna(value=devices1.groupby(by=['brand_name'])['battery'].transform('median'))
#filling null values of weight column with median of column values when organized by brand_name
devices1['weight']=devices1['weight'].fillna(value=devices1.groupby(by=['brand_name'])['weight'].transform('median'))
In [28]:
#viewing the sum of null values in each column of devices1
devices1.isnull().sum()
Out[28]:
brand_name                0
os                        0
screen_size               0
4g                        0
5g                        0
main_camera_mp           10
selfie_camera_mp          0
int_memory                0
ram                       0
battery                   0
weight                    0
release_year              0
days_used                 0
normalized_used_price     0
normalized_new_price      0
dtype: int64

Observations:¶

  • There are still 10 null values in the main_camera_mp column of the devices1 dataframe.
In [29]:
#locating the null values in the main_camera_mp column of devices 1
devices1.loc[devices1['main_camera_mp'].isnull()]
Out[29]:
brand_name os screen_size 4g 5g main_camera_mp selfie_camera_mp int_memory ram battery weight release_year days_used normalized_used_price normalized_new_price
59 Infinix Android 17.32 yes no NaN 8.0 32.0 2.0 6000.0 209.0 2020 245 4.282068 4.597642
60 Infinix Android 15.39 yes no NaN 8.0 64.0 4.0 5000.0 185.0 2020 173 4.363608 4.711780
61 Infinix Android 15.39 yes no NaN 8.0 32.0 2.0 5000.0 185.0 2020 256 4.181439 4.505460
62 Infinix Android 15.39 yes no NaN 16.0 32.0 3.0 4000.0 178.0 2019 316 4.555244 4.602166
63 Infinix Android 15.29 yes no NaN 16.0 32.0 2.0 4000.0 165.0 2019 468 4.416670 4.871297
278 Infinix Android 17.32 yes no NaN 8.0 32.0 2.0 6000.0 209.0 2020 320 4.405133 4.605370
279 Infinix Android 15.39 yes no NaN 8.0 64.0 4.0 5000.0 185.0 2020 173 4.495913 4.702115
280 Infinix Android 15.39 yes no NaN 8.0 32.0 2.0 5000.0 185.0 2020 329 4.370713 4.487287
281 Infinix Android 15.39 yes no NaN 16.0 32.0 3.0 4000.0 178.0 2019 356 4.417997 4.605970
282 Infinix Android 15.29 yes no NaN 16.0 32.0 2.0 4000.0 165.0 2019 497 4.423289 4.866072

Observations:¶

  • Infinix used devices have no values for main_camera_mp, so there is no median for them to be replaced with.
  • All of the other columns for Inifnix have values, so dropping this column might not be the best idea.
  • It is best to replace them with the median values of the column organized by operating system.
In [30]:
#filling in the remaining null values of the main_camera_mp column in devices1 using median values of the column when organized by os
devices1['main_camera_mp'] = devices1['main_camera_mp'].fillna(value=devices1.groupby(by=['os'])['main_camera_mp'].transform('median'))
In [31]:
#viewing the sum of null values in each column of devices1
devices1.isnull().sum()
Out[31]:
brand_name               0
os                       0
screen_size              0
4g                       0
5g                       0
main_camera_mp           0
selfie_camera_mp         0
int_memory               0
ram                      0
battery                  0
weight                   0
release_year             0
days_used                0
normalized_used_price    0
normalized_new_price     0
dtype: int64

Observations:¶

  • There are no null values now in the devices1 dataframe.

Feature Engineering is not needed for this dataset.

Outlier Detection and Treatment¶

In [32]:
#creating a variable called devices1_numerical to contain only the numerical columns of devices1
devices1_numerical = devices1.select_dtypes(np.number)

#making boxplot for each numerical variable
for i in devices1_numerical.columns: #using for loop for each column in devices1_numerical
    sns.boxplot(data=devices1_numerical, x=i) #making a boxplot for each variable using data from devices1_numerical
    plt.title(i) #setting title of boxplot
    plt.show(); #displaying boxplot

Observations:¶

  • There are outliers in the data, however they are important to keep.
  • They represent accurate information, so they will not be treated.

Preparing Data for Modeling¶

In [33]:
#the main variable in focus in the normalized_used_price, so that is Y
#X will hold the other variables
#X will be all the columns when the normalized_used_price column is dropped
X = devices1.drop(['normalized_used_price'], axis=1)
#y is the normalized_used_price column of the devices dataframe
y = devices1['normalized_used_price']
In [34]:
#viewing X
X
Out[34]:
brand_name os screen_size 4g 5g main_camera_mp selfie_camera_mp int_memory ram battery weight release_year days_used normalized_new_price
0 Honor Android 14.50 yes no 13.0 5.0 64.0 3.0 3020.0 146.0 2020 127 4.715100
1 Honor Android 17.30 yes yes 13.0 16.0 128.0 8.0 4300.0 213.0 2020 325 5.519018
2 Honor Android 16.69 yes yes 13.0 8.0 128.0 8.0 4200.0 213.0 2020 162 5.884631
3 Honor Android 25.50 yes yes 13.0 8.0 64.0 6.0 7250.0 480.0 2020 345 5.630961
4 Honor Android 15.32 yes no 13.0 8.0 64.0 3.0 5000.0 185.0 2020 293 4.947837
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3449 Asus Android 15.34 yes no 13.0 8.0 64.0 6.0 5000.0 190.0 2019 232 6.483872
3450 Asus Android 15.24 yes no 13.0 8.0 128.0 8.0 4000.0 200.0 2018 541 6.251538
3451 Alcatel Android 15.80 yes no 13.0 5.0 32.0 3.0 4000.0 165.0 2020 201 4.528829
3452 Alcatel Android 15.80 yes no 13.0 5.0 32.0 2.0 4000.0 160.0 2020 149 4.624188
3453 Alcatel Android 12.83 yes no 13.0 5.0 16.0 2.0 4000.0 168.0 2020 176 4.279994

3454 rows × 14 columns

In [35]:
#viewing y
y
Out[35]:
0       4.307572
1       5.162097
2       5.111084
3       5.135387
4       4.389995
          ...   
3449    4.492337
3450    5.037732
3451    4.357350
3452    4.349762
3453    4.132122
Name: normalized_used_price, Length: 3454, dtype: float64

Observations:¶

  • The X and y variables have been created.
  • X holds every column other than normalized_used_price and y holds normalized_used_price.
In [36]:
#using sm.add_constant(X) function to add an intercept
X = sm.add_constant(X)

#adding dummy variables for the 4 categorical (object) variables in devices1 (brand_name, os, 4g, and 5g)
#first variable in the dummy variables is dropped for each object variable using drop_first=True
X = pd.get_dummies(X, columns = X.select_dtypes(include=['object']).columns.tolist(), drop_first=True)

#viewing X to make sure the dummy variables have been created
X
Out[36]:
const screen_size main_camera_mp selfie_camera_mp int_memory ram battery weight release_year days_used ... brand_name_Spice brand_name_Vivo brand_name_XOLO brand_name_Xiaomi brand_name_ZTE os_Others os_Windows os_iOS 4g_yes 5g_yes
0 1.0 14.50 13.0 5.0 64.0 3.0 3020.0 146.0 2020 127 ... 0 0 0 0 0 0 0 0 1 0
1 1.0 17.30 13.0 16.0 128.0 8.0 4300.0 213.0 2020 325 ... 0 0 0 0 0 0 0 0 1 1
2 1.0 16.69 13.0 8.0 128.0 8.0 4200.0 213.0 2020 162 ... 0 0 0 0 0 0 0 0 1 1
3 1.0 25.50 13.0 8.0 64.0 6.0 7250.0 480.0 2020 345 ... 0 0 0 0 0 0 0 0 1 1
4 1.0 15.32 13.0 8.0 64.0 3.0 5000.0 185.0 2020 293 ... 0 0 0 0 0 0 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3449 1.0 15.34 13.0 8.0 64.0 6.0 5000.0 190.0 2019 232 ... 0 0 0 0 0 0 0 0 1 0
3450 1.0 15.24 13.0 8.0 128.0 8.0 4000.0 200.0 2018 541 ... 0 0 0 0 0 0 0 0 1 0
3451 1.0 15.80 13.0 5.0 32.0 3.0 4000.0 165.0 2020 201 ... 0 0 0 0 0 0 0 0 1 0
3452 1.0 15.80 13.0 5.0 32.0 2.0 4000.0 160.0 2020 149 ... 0 0 0 0 0 0 0 0 1 0
3453 1.0 12.83 13.0 5.0 16.0 2.0 4000.0 168.0 2020 176 ... 0 0 0 0 0 0 0 0 1 0

3454 rows × 49 columns

Observations:¶

  • Dummy variables have been created for brand_name, os, 4g, and 5g columns in devices1.
In [37]:
#dividing the data into train and test groups
#test_size is set to 0.3 to make 70:30 train to test ratio
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
In [38]:
#viewing the shape of x_train
x_train.shape[0]
Out[38]:
2417
In [39]:
#viewing the shape of x_test 
x_test.shape[0]
Out[39]:
1037

Observations:¶

  • The data has been split into a 70:30 train to test ratio.

After Data Preprocessing: Exploratory Data Analysis (EDA)¶

The columns main_camera_mp, selfie_camera_mp, int_memory, ram, battery, and weight have been manipulated as the null values have been replaced. It is best to make sure that this did not make changes to the distribution of data.

In [40]:
#the columns where null values have been replaced are all numerical
#viewing histograms for the columns in the previously created variable devices1_numerical
#color is set to thistle
for i in devices1_numerical.columns: #for loop for each column in devices1_numerical
    sns.histplot(data=devices1_numerical, x=i, color='thistle') #using data from devices1_numerical to plot variable on x-axis
    plt.title(i) #setting title of histogram
    plt.show(); #displaying histogram

Observations:¶

There are no important changes in the distribution of data for the columns that have been manipulated.

In [41]:
#viewing any changes in correlation by using a heatmap
#using annot=True to display labels and range is from -1 to 1
#decimals are limited to 2 spaces, and color is in shades of blue
sns.heatmap(devices1_numerical.corr(), annot=True, vmin=-1, vmax=1, fmt='.2f', cmap='Blues')
plt.show(); #displaying heatmap

Observations:¶

  • There are no important changes in the correlation between numerical columns in devices1.

Model Building - Linear Regression¶

In [42]:
#building the linear regression model as olsmodel
olsmodel = sm.OLS(y_train, x_train).fit()
#printing the summary of the linear regression model
print(olsmodel.summary())
                              OLS Regression Results                             
=================================================================================
Dep. Variable:     normalized_used_price   R-squared:                       0.845
Model:                               OLS   Adj. R-squared:                  0.842
Method:                    Least Squares   F-statistic:                     268.8
Date:                   Thu, 16 Nov 2023   Prob (F-statistic):               0.00
Time:                           21:37:08   Log-Likelihood:                 124.15
No. Observations:                   2417   AIC:                            -150.3
Df Residuals:                       2368   BIC:                             133.4
Df Model:                             48                                         
Covariance Type:               nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                   -46.4646      9.197     -5.052      0.000     -64.500     -28.430
screen_size               0.0244      0.003      7.156      0.000       0.018       0.031
main_camera_mp            0.0208      0.002     13.848      0.000       0.018       0.024
selfie_camera_mp          0.0135      0.001     11.996      0.000       0.011       0.016
int_memory                0.0001   6.97e-05      1.664      0.096   -2.07e-05       0.000
ram                       0.0232      0.005      4.515      0.000       0.013       0.033
battery               -1.686e-05   7.27e-06     -2.318      0.021   -3.11e-05    -2.6e-06
weight                    0.0010      0.000      7.488      0.000       0.001       0.001
release_year              0.0236      0.005      5.189      0.000       0.015       0.033
days_used              4.196e-05   3.09e-05      1.360      0.174   -1.85e-05       0.000
normalized_new_price      0.4309      0.012     35.134      0.000       0.407       0.455
brand_name_Alcatel        0.0154      0.048      0.324      0.746      -0.078       0.109
brand_name_Apple         -0.0032      0.147     -0.021      0.983      -0.292       0.285
brand_name_Asus           0.0150      0.048      0.313      0.754      -0.079       0.109
brand_name_BlackBerry    -0.0297      0.070     -0.423      0.672      -0.167       0.108
brand_name_Celkon        -0.0463      0.066     -0.699      0.484      -0.176       0.084
brand_name_Coolpad        0.0209      0.073      0.286      0.775      -0.122       0.164
brand_name_Gionee         0.0447      0.058      0.775      0.438      -0.068       0.158
brand_name_Google        -0.0327      0.085     -0.386      0.700      -0.199       0.133
brand_name_HTC           -0.0131      0.048     -0.271      0.786      -0.108       0.081
brand_name_Honor          0.0316      0.049      0.642      0.521      -0.065       0.128
brand_name_Huawei        -0.0022      0.044     -0.049      0.961      -0.089       0.085
brand_name_Infinix        0.0801      0.093      0.859      0.390      -0.103       0.263
brand_name_Karbonn        0.0943      0.067      1.406      0.160      -0.037       0.226
brand_name_LG            -0.0132      0.045     -0.292      0.771      -0.102       0.076
brand_name_Lava           0.0332      0.062      0.533      0.594      -0.089       0.155
brand_name_Lenovo         0.0453      0.045      1.003      0.316      -0.043       0.134
brand_name_Meizu         -0.0130      0.056     -0.232      0.817      -0.123       0.097
brand_name_Micromax      -0.0337      0.048     -0.704      0.481      -0.128       0.060
brand_name_Microsoft      0.0947      0.088      1.072      0.284      -0.079       0.268
brand_name_Motorola      -0.0113      0.050     -0.228      0.820      -0.109       0.086
brand_name_Nokia          0.0705      0.052      1.362      0.173      -0.031       0.172
brand_name_OnePlus        0.0707      0.077      0.913      0.361      -0.081       0.222
brand_name_Oppo           0.0124      0.048      0.259      0.796      -0.081       0.106
brand_name_Others        -0.0080      0.042     -0.191      0.849      -0.091       0.074
brand_name_Panasonic      0.0562      0.056      1.006      0.314      -0.053       0.166
brand_name_Realme         0.0319      0.062      0.517      0.605      -0.089       0.153
brand_name_Samsung       -0.0314      0.043     -0.726      0.468      -0.116       0.053
brand_name_Sony          -0.0616      0.050     -1.221      0.222      -0.161       0.037
brand_name_Spice         -0.0148      0.063     -0.234      0.815      -0.139       0.109
brand_name_Vivo          -0.0155      0.048     -0.320      0.749      -0.111       0.080
brand_name_XOLO           0.0151      0.055      0.276      0.783      -0.092       0.123
brand_name_Xiaomi         0.0868      0.048      1.804      0.071      -0.008       0.181
brand_name_ZTE           -0.0058      0.047     -0.122      0.903      -0.099       0.087
os_Others                -0.0519      0.033     -1.585      0.113      -0.116       0.012
os_Windows               -0.0202      0.045     -0.448      0.654      -0.109       0.068
os_iOS                   -0.0669      0.146     -0.457      0.648      -0.354       0.220
4g_yes                    0.0530      0.016      3.341      0.001       0.022       0.084
5g_yes                   -0.0721      0.031     -2.292      0.022      -0.134      -0.010
==============================================================================
Omnibus:                      223.220   Durbin-Watson:                   1.911
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              422.514
Skew:                          -0.618   Prob(JB):                     1.79e-92
Kurtosis:                       4.633   Cond. No.                     7.70e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.7e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

Model Performance Check¶

In [43]:
#assigning the R-squared of the model to r_squared variable
r_squared = olsmodel.rsquared
In [44]:
#assigning the Adjusted R-squared of the model to adjusted_r_squared variable
adjusted_r_squared = olsmodel.rsquared_adj
In [45]:
#assigning Mean Absolute Error for training set to training_mae variable
training_mae = mean_absolute_error(y_train, olsmodel.predict(x_train))
#assigning Mean Absolute Error for testing set to testing_mae variable
testing_mae = mean_absolute_error(y_test, olsmodel.predict(x_test))
In [46]:
#assigning Mean Square Error for training set to training_mse variable
training_mse = mean_squared_error(y_train, olsmodel.predict(x_train))
#assigning Mean Square Error for testing set to testing_mse variable
testing_mse = mean_squared_error(y_test, olsmodel.predict(x_test))
In [47]:
#assigning Root Mean Square Error for training set to training_rmse variable
training_rmse = np.sqrt(training_mse)
#assigning Root Mean Square Error for testing set to testing_rmse variable
testing_rmse = np.sqrt(testing_mse)
In [48]:
#creating a dataframe with all of the above performance metrics
#creating small lists of performance metric labels and the corresponding variables from above to make one whole list called performance
performance = [['R-squared', r_squared], ['Adjusted R-squared', adjusted_r_squared], ['Training MAE', training_mae],
              ['Testing MAE', testing_mae], ['Training MSE', training_mse],['Testing MSE', testing_mse],
              ['Training RMSE', training_rmse], ['Testing RMSE', testing_rmse]]
#creating a dataframe from the performance list and adding titles to the columns
model_performance = pd.DataFrame(performance, columns = ['Performance Metric', 'Value'])
model_performance #dataframe with the metrics and corresponding values
Out[48]:
Performance Metric Value
0 R-squared 0.844924
1 Adjusted R-squared 0.841780
2 Training MAE 0.180302
3 Testing MAE 0.184868
4 Training MSE 0.052834
5 Testing MSE 0.056874
6 Training RMSE 0.229856
7 Testing RMSE 0.238482

Observations:¶

  • R-squared is 0.845, so 84.5% of the variance in the training set is measured by this model.
  • Adjusted R-squared is 0.842, so this model is a good fit.
  • The R-squared and Adjusted R-squared show that the model is not an underfit.
  • The Mean Absolute Error is similar for the training and testing set.
  • The Mean Sqaured Error is similar for the training and testing set.
  • The Root Mean Squared Error is similar for the training and testing set.
  • This shows that the model is a good fit and not an overfit.
  • Mean Absolute Error is 0.18, so this shows that there is a 0.18 error in the model's ability to predict normalized_used_device prices on the test set.
  • The Mean Squared Error and Root Mean Squared Error are small, which show that the predicted values in the line of regression are close to the actual data values.

Checking Linear Regression Assumptions¶

Multicollinearity¶

In [49]:
#importing the variaition_inflation_factor function for calculating VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor

#creating a dataframe called vif 
#one column called Variables holds the columns in x.train
#one column called Values holds the variation inflation factor for the corresponding columns
#using for loop to calculate vif with variation_inflation_factor() function for each column in x_train
vif=pd.DataFrame()
vif['Variables'] = x_train.columns
vif['Values'] = [variance_inflation_factor(x_train.values, i) 
             for i in range(len(x_train.columns))] 

#to avoid displaying scientific notation for large numbers, the format is set to 6 decimal spaces
pd.options.display.float_format = '{:.6f}'.format 
vif #displaying vif dataframe
Out[49]:
Variables Values
0 const 3791081.976721
1 screen_size 7.677118
2 main_camera_mp 2.285811
3 selfie_camera_mp 2.810716
4 int_memory 1.364057
5 ram 2.255246
6 battery 4.081715
7 weight 6.397183
8 release_year 4.898360
9 days_used 2.660002
10 normalized_new_price 3.119569
11 brand_name_Alcatel 3.405674
12 brand_name_Apple 13.054832
13 brand_name_Asus 3.332003
14 brand_name_BlackBerry 1.632250
15 brand_name_Celkon 1.774820
16 brand_name_Coolpad 1.467981
17 brand_name_Gionee 1.951247
18 brand_name_Google 1.321771
19 brand_name_HTC 3.410252
20 brand_name_Honor 3.340621
21 brand_name_Huawei 5.983857
22 brand_name_Infinix 1.285635
23 brand_name_Karbonn 1.573683
24 brand_name_LG 4.849589
25 brand_name_Lava 1.711317
26 brand_name_Lenovo 4.558847
27 brand_name_Meizu 2.179607
28 brand_name_Micromax 3.363518
29 brand_name_Microsoft 1.869558
30 brand_name_Motorola 3.274455
31 brand_name_Nokia 3.473140
32 brand_name_OnePlus 1.437047
33 brand_name_Oppo 3.971065
34 brand_name_Others 9.710921
35 brand_name_Panasonic 2.105711
36 brand_name_Realme 1.946675
37 brand_name_Samsung 7.539832
38 brand_name_Sony 2.943127
39 brand_name_Spice 1.688868
40 brand_name_Vivo 3.651320
41 brand_name_XOLO 2.138074
42 brand_name_Xiaomi 3.719678
43 brand_name_ZTE 3.797527
44 os_Others 1.854134
45 os_Windows 1.595291
46 os_iOS 11.780766
47 4g_yes 2.468374
48 5g_yes 1.811042
In [50]:
#the highest VIF is seen in the brand_name_Apple column
#creating new set called x_train1 by dropping brand_name_Apple column from x_train set
x_train1 = x_train.drop(['brand_name_Apple'],axis = 1) 
olsmodel1 = sm.OLS(y_train, x_train1).fit() #building a linear regression model called olsmodel1 using x_train1 with y_train
print('R-squared', olsmodel1.rsquared) #printing R-squared for model
print('Adjusted R-squared', olsmodel1.rsquared_adj) #printing Adjusted R-squared for model
R-squared 0.8449234815864929
Adjusted R-squared 0.8418468263034896

Observations:¶

brand_name_Apple has the highest VIF when compared to other variables, and dropping it does not have a much of an impact on the R-squared and Adjusted R-squared.

In [51]:
#removing the brand_name_Apple column from original x_train
x_train = x_train.drop(['brand_name_Apple'], axis=1)
#removing brand_name column from x_test as well
x_test = x_test.drop(['brand_name_Apple'], axis=1)
#building a new model called olsmodel2 for the modified x_train with y_train
olsmodel2 = sm.OLS(y_train, x_train).fit()
print(olsmodel2.summary()) #printing the summary of olsmodel2
                              OLS Regression Results                             
=================================================================================
Dep. Variable:     normalized_used_price   R-squared:                       0.845
Model:                               OLS   Adj. R-squared:                  0.842
Method:                    Least Squares   F-statistic:                     274.6
Date:                   Thu, 16 Nov 2023   Prob (F-statistic):               0.00
Time:                           21:37:09   Log-Likelihood:                 124.15
No. Observations:                   2417   AIC:                            -152.3
Df Residuals:                       2369   BIC:                             125.6
Df Model:                             47                                         
Covariance Type:               nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                   -46.4564      9.187     -5.057      0.000     -64.472     -28.441
screen_size               0.0244      0.003      7.172      0.000       0.018       0.031
main_camera_mp            0.0208      0.002     13.852      0.000       0.018       0.024
selfie_camera_mp          0.0135      0.001     12.046      0.000       0.011       0.016
int_memory                0.0001   6.97e-05      1.664      0.096   -2.07e-05       0.000
ram                       0.0232      0.005      4.522      0.000       0.013       0.033
battery               -1.686e-05   7.27e-06     -2.320      0.020   -3.11e-05   -2.61e-06
weight                    0.0010      0.000      7.492      0.000       0.001       0.001
release_year              0.0236      0.005      5.194      0.000       0.015       0.033
days_used              4.197e-05   3.08e-05      1.361      0.174   -1.85e-05       0.000
normalized_new_price      0.4309      0.012     35.238      0.000       0.407       0.455
brand_name_Alcatel        0.0157      0.046      0.338      0.735      -0.075       0.107
brand_name_Asus           0.0153      0.047      0.328      0.743      -0.076       0.107
brand_name_BlackBerry    -0.0294      0.069     -0.428      0.668      -0.164       0.105
brand_name_Celkon        -0.0461      0.065     -0.705      0.481      -0.174       0.082
brand_name_Coolpad        0.0211      0.072      0.293      0.770      -0.120       0.162
brand_name_Gionee         0.0450      0.057      0.792      0.428      -0.066       0.156
brand_name_Google        -0.0324      0.084     -0.387      0.699      -0.196       0.132
brand_name_HTC           -0.0129      0.047     -0.274      0.784      -0.105       0.079
brand_name_Honor          0.0318      0.048      0.666      0.506      -0.062       0.126
brand_name_Huawei        -0.0019      0.043     -0.045      0.964      -0.086       0.082
brand_name_Infinix        0.0803      0.093      0.868      0.385      -0.101       0.262
brand_name_Karbonn        0.0945      0.066      1.422      0.155      -0.036       0.225
brand_name_LG            -0.0130      0.044     -0.295      0.768      -0.099       0.073
brand_name_Lava           0.0334      0.062      0.542      0.588      -0.087       0.154
brand_name_Lenovo         0.0456      0.044      1.039      0.299      -0.040       0.132
brand_name_Meizu         -0.0127      0.055     -0.232      0.817      -0.120       0.095
brand_name_Micromax      -0.0335      0.047     -0.716      0.474      -0.125       0.058
brand_name_Microsoft      0.0949      0.087      1.085      0.278      -0.077       0.266
brand_name_Motorola      -0.0111      0.048     -0.229      0.819      -0.106       0.084
brand_name_Nokia          0.0708      0.050      1.412      0.158      -0.028       0.169
brand_name_OnePlus        0.0709      0.076      0.928      0.354      -0.079       0.221
brand_name_Oppo           0.0126      0.046      0.271      0.786      -0.078       0.104
brand_name_Others        -0.0078      0.041     -0.192      0.848      -0.088       0.072
brand_name_Panasonic      0.0564      0.055      1.029      0.304      -0.051       0.164
brand_name_Realme         0.0321      0.060      0.531      0.596      -0.087       0.151
brand_name_Samsung       -0.0311      0.042     -0.748      0.455      -0.113       0.050
brand_name_Sony          -0.0614      0.049     -1.247      0.212      -0.158       0.035
brand_name_Spice         -0.0146      0.063     -0.234      0.815      -0.137       0.108
brand_name_Vivo          -0.0152      0.047     -0.324      0.746      -0.108       0.077
brand_name_XOLO           0.0153      0.054      0.284      0.776      -0.090       0.121
brand_name_Xiaomi         0.0870      0.047      1.862      0.063      -0.005       0.179
brand_name_ZTE           -0.0056      0.046     -0.121      0.904      -0.096       0.085
os_Others                -0.0521      0.032     -1.645      0.100      -0.114       0.010
os_Windows               -0.0202      0.045     -0.449      0.653      -0.109       0.068
os_iOS                   -0.0698      0.059     -1.185      0.236      -0.185       0.046
4g_yes                    0.0530      0.016      3.342      0.001       0.022       0.084
5g_yes                   -0.0721      0.031     -2.296      0.022      -0.134      -0.011
==============================================================================
Omnibus:                      223.213   Durbin-Watson:                   1.911
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              422.547
Skew:                          -0.618   Prob(JB):                     1.76e-92
Kurtosis:                       4.633   Cond. No.                     7.69e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.69e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
In [52]:
#rechecking the variation inflation factor for the columns in x_train
#creating a dataframe called vif 
#one column called Variables holds the columns in x.train
#one column called Values holds the variation inflation factor for the corresponding columns
#using for loop to calculate vif with variation_inflation_factor() function for each column in x_train
vif=pd.DataFrame()
vif['Variables'] = x_train.columns
vif['Values'] = [variance_inflation_factor(x_train.values, i) 
             for i in range(len(x_train.columns))] 
                    
vif #displaying vif dataframe
Out[52]:
Variables Values
0 const 3784495.420947
1 screen_size 7.643504
2 main_camera_mp 2.285201
3 selfie_camera_mp 2.789591
4 int_memory 1.364046
5 ram 2.247171
6 battery 4.079641
7 weight 6.394451
8 release_year 4.889730
9 days_used 2.659521
10 normalized_new_price 3.102357
11 brand_name_Alcatel 3.230621
12 brand_name_Asus 3.144923
13 brand_name_BlackBerry 1.561068
14 brand_name_Celkon 1.731859
15 brand_name_Coolpad 1.436792
16 brand_name_Gionee 1.886305
17 brand_name_Google 1.293140
18 brand_name_HTC 3.240385
19 brand_name_Honor 3.159809
20 brand_name_Huawei 5.581499
21 brand_name_Infinix 1.266884
22 brand_name_Karbonn 1.544215
23 brand_name_LG 4.564980
24 brand_name_Lava 1.670716
25 brand_name_Lenovo 4.291535
26 brand_name_Meizu 2.092794
27 brand_name_Micromax 3.214160
28 brand_name_Microsoft 1.835387
29 brand_name_Motorola 3.109598
30 brand_name_Nokia 3.258328
31 brand_name_OnePlus 1.402472
32 brand_name_Oppo 3.762609
33 brand_name_Others 9.075017
34 brand_name_Panasonic 2.030879
35 brand_name_Realme 1.878667
36 brand_name_Samsung 6.991243
37 brand_name_Sony 2.799041
38 brand_name_Spice 1.655390
39 brand_name_Vivo 3.447542
40 brand_name_XOLO 2.069721
41 brand_name_Xiaomi 3.512985
42 brand_name_ZTE 3.604237
43 os_Others 1.734615
44 os_Windows 1.593031
45 os_iOS 1.908364
46 4g_yes 2.467374
47 5g_yes 1.802632
In [53]:
#the highest VIF value is seen in the brand_name_Others column
#creating new set called x_train2 by dropping brand_name_Others column from x_train set
x_train2 = x_train.drop(['brand_name_Others'],axis = 1) 
olsmodel3 = sm.OLS(y_train, x_train2).fit() #building a linear regression model called olsmodel3 using x_train2 with y_train
print('R-squared', olsmodel3.rsquared) #printing R-squared for model
print('Adjusted R-squared', olsmodel3.rsquared_adj) #printing Adjusted R-squared for model
R-squared 0.8449210795185332
Adjusted R-squared 0.8419111089100322

Observations:¶

brand_name_Others has the next highest VIF when compared to the other variables, and dropping it does not have much of an impact on R-squared and Adjusted R-squared.

In [54]:
#removing the brand_name_Others column from original x_train
x_train = x_train.drop(['brand_name_Others'], axis=1)
#removing brand_name_Others column from x_test as well
x_test = x_test.drop(['brand_name_Others'], axis=1)
#building a new model called olsmodel4 for the modified x_train with y_train
olsmodel4 = sm.OLS(y_train, x_train).fit()
print(olsmodel4.summary()) #printing the summary of olsmodel4
                              OLS Regression Results                             
=================================================================================
Dep. Variable:     normalized_used_price   R-squared:                       0.845
Model:                               OLS   Adj. R-squared:                  0.842
Method:                    Least Squares   F-statistic:                     280.7
Date:                   Thu, 16 Nov 2023   Prob (F-statistic):               0.00
Time:                           21:37:10   Log-Likelihood:                 124.13
No. Observations:                   2417   AIC:                            -154.3
Df Residuals:                       2370   BIC:                             117.9
Df Model:                             46                                         
Covariance Type:               nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                   -46.4460      9.185     -5.057      0.000     -64.458     -28.434
screen_size               0.0244      0.003      7.227      0.000       0.018       0.031
main_camera_mp            0.0208      0.002     13.855      0.000       0.018       0.024
selfie_camera_mp          0.0135      0.001     12.047      0.000       0.011       0.016
int_memory                0.0001   6.97e-05      1.661      0.097   -2.09e-05       0.000
ram                       0.0232      0.005      4.534      0.000       0.013       0.033
battery               -1.685e-05   7.27e-06     -2.319      0.020   -3.11e-05    -2.6e-06
weight                    0.0010      0.000      7.496      0.000       0.001       0.001
release_year              0.0236      0.005      5.193      0.000       0.015       0.033
days_used              4.203e-05   3.08e-05      1.363      0.173   -1.84e-05       0.000
normalized_new_price      0.4309      0.012     35.255      0.000       0.407       0.455
brand_name_Alcatel        0.0227      0.028      0.797      0.426      -0.033       0.078
brand_name_Asus           0.0223      0.029      0.775      0.439      -0.034       0.079
brand_name_BlackBerry    -0.0224      0.058     -0.386      0.700      -0.137       0.092
brand_name_Celkon        -0.0390      0.054     -0.724      0.469      -0.145       0.067
brand_name_Coolpad        0.0282      0.062      0.457      0.647      -0.093       0.149
brand_name_Gionee         0.0520      0.043      1.206      0.228      -0.033       0.137
brand_name_Google        -0.0254      0.075     -0.337      0.736      -0.173       0.122
brand_name_HTC           -0.0057      0.029     -0.200      0.842      -0.062       0.051
brand_name_Honor          0.0389      0.030      1.278      0.202      -0.021       0.099
brand_name_Huawei         0.0051      0.022      0.230      0.818      -0.039       0.049
brand_name_Infinix        0.0875      0.085      1.033      0.302      -0.079       0.254
brand_name_Karbonn        0.1015      0.055      1.833      0.067      -0.007       0.210
brand_name_LG            -0.0059      0.024     -0.248      0.804      -0.053       0.041
brand_name_Lava           0.0405      0.049      0.821      0.412      -0.056       0.137
brand_name_Lenovo         0.0526      0.024      2.180      0.029       0.005       0.100
brand_name_Meizu         -0.0056      0.040     -0.139      0.889      -0.085       0.074
brand_name_Micromax      -0.0264      0.029     -0.915      0.360      -0.083       0.030
brand_name_Microsoft      0.1021      0.079      1.294      0.196      -0.053       0.257
brand_name_Motorola      -0.0039      0.031     -0.128      0.898      -0.064       0.056
brand_name_Nokia          0.0779      0.034      2.301      0.021       0.011       0.144
brand_name_OnePlus        0.0780      0.067      1.162      0.245      -0.054       0.210
brand_name_Oppo           0.0197      0.028      0.702      0.483      -0.035       0.075
brand_name_Panasonic      0.0635      0.040      1.570      0.117      -0.016       0.143
brand_name_Realme         0.0392      0.048      0.823      0.410      -0.054       0.133
brand_name_Samsung       -0.0241      0.020     -1.220      0.222      -0.063       0.015
brand_name_Sony          -0.0543      0.032     -1.677      0.094      -0.118       0.009
brand_name_Spice         -0.0076      0.051     -0.149      0.881      -0.107       0.092
brand_name_Vivo          -0.0082      0.029     -0.280      0.779      -0.065       0.049
brand_name_XOLO           0.0223      0.040      0.564      0.573      -0.055       0.100
brand_name_Xiaomi         0.0941      0.029      3.293      0.001       0.038       0.150
brand_name_ZTE            0.0016      0.027      0.057      0.955      -0.052       0.055
os_Others                -0.0516      0.032     -1.635      0.102      -0.114       0.010
os_Windows               -0.0204      0.045     -0.453      0.650      -0.109       0.068
os_iOS                   -0.0628      0.046     -1.358      0.175      -0.154       0.028
4g_yes                    0.0529      0.016      3.337      0.001       0.022       0.084
5g_yes                   -0.0722      0.031     -2.302      0.021      -0.134      -0.011
==============================================================================
Omnibus:                      223.194   Durbin-Watson:                   1.911
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              422.387
Skew:                          -0.618   Prob(JB):                     1.90e-92
Kurtosis:                       4.633   Cond. No.                     7.69e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.69e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
In [55]:
#rechecking the variation inflation factor for the columns in x_train
#creating a dataframe called vif 
#one column called Variables holds the columns in x.train
#one column called Values holds the variation inflation factor for the corresponding columns
#using for loop to calculate vif with variation_inflation_factor() function for each column in x_train
vif=pd.DataFrame()
vif['Variables'] = x_train.columns
vif['Values'] = [variance_inflation_factor(x_train.values, i) 
             for i in range(len(x_train.columns))] 
                    
vif #displaying vif dataframe
Out[55]:
Variables Values
0 const 3784363.592994
1 screen_size 7.570153
2 main_camera_mp 2.283728
3 selfie_camera_mp 2.787818
4 int_memory 1.363394
5 ram 2.243844
6 battery 4.079417
7 weight 6.368547
8 release_year 4.889201
9 days_used 2.659263
10 normalized_new_price 3.101257
11 brand_name_Alcatel 1.217785
12 brand_name_Asus 1.200109
13 brand_name_BlackBerry 1.123993
14 brand_name_Celkon 1.174580
15 brand_name_Coolpad 1.053575
16 brand_name_Gionee 1.090929
17 brand_name_Google 1.044519
18 brand_name_HTC 1.220945
19 brand_name_Honor 1.280959
20 brand_name_Huawei 1.500803
21 brand_name_Infinix 1.061884
22 brand_name_Karbonn 1.073813
23 brand_name_LG 1.344151
24 brand_name_Lava 1.071000
25 brand_name_Lenovo 1.298072
26 brand_name_Meizu 1.134630
27 brand_name_Micromax 1.227730
28 brand_name_Microsoft 1.494129
29 brand_name_Motorola 1.256765
30 brand_name_Nokia 1.485057
31 brand_name_OnePlus 1.079844
32 brand_name_Oppo 1.371004
33 brand_name_Panasonic 1.106442
34 brand_name_Realme 1.167394
35 brand_name_Samsung 1.576652
36 brand_name_Sony 1.212002
37 brand_name_Spice 1.080742
38 brand_name_Vivo 1.324005
39 brand_name_XOLO 1.118852
40 brand_name_Xiaomi 1.313746
41 brand_name_ZTE 1.263080
42 os_Others 1.723014
43 os_Windows 1.592335
44 os_iOS 1.176653
45 4g_yes 2.462358
46 5g_yes 1.801670
In [56]:
#the highest VIF value is seen in the screen_size column
#creating new set called x_train3 by dropping screen_size column from x_train set
x_train3 = x_train.drop(['screen_size'],axis = 1) 
olsmodel5 = sm.OLS(y_train, x_train2).fit() #building a linear regression model called olsmodel5 using x_train3 with y_train
print('R-squared', olsmodel5.rsquared) #printing R-squared for model
print('Adjusted R-squared', olsmodel5.rsquared_adj) #printing Adjusted R-squared for model
R-squared 0.8449210795185332
Adjusted R-squared 0.8419111089100322

Observations:¶

screen_size has the next highest VIF, and dropping it does not have much of an effect on R-squared and Adjusted R-squared.

In [57]:
#removing the screen_size column from original x_train
x_train = x_train.drop(['screen_size'], axis=1)
#removing screen_size column from x_test as well
x_test = x_test.drop(['screen_size'], axis=1)
#building a new model called olsmodel6 for the modified x_train with y_train
olsmodel6 = sm.OLS(y_train, x_train).fit()
print(olsmodel4.summary()) #printing the summary of olsmodel4
                              OLS Regression Results                             
=================================================================================
Dep. Variable:     normalized_used_price   R-squared:                       0.845
Model:                               OLS   Adj. R-squared:                  0.842
Method:                    Least Squares   F-statistic:                     280.7
Date:                   Thu, 16 Nov 2023   Prob (F-statistic):               0.00
Time:                           21:37:11   Log-Likelihood:                 124.13
No. Observations:                   2417   AIC:                            -154.3
Df Residuals:                       2370   BIC:                             117.9
Df Model:                             46                                         
Covariance Type:               nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                   -46.4460      9.185     -5.057      0.000     -64.458     -28.434
screen_size               0.0244      0.003      7.227      0.000       0.018       0.031
main_camera_mp            0.0208      0.002     13.855      0.000       0.018       0.024
selfie_camera_mp          0.0135      0.001     12.047      0.000       0.011       0.016
int_memory                0.0001   6.97e-05      1.661      0.097   -2.09e-05       0.000
ram                       0.0232      0.005      4.534      0.000       0.013       0.033
battery               -1.685e-05   7.27e-06     -2.319      0.020   -3.11e-05    -2.6e-06
weight                    0.0010      0.000      7.496      0.000       0.001       0.001
release_year              0.0236      0.005      5.193      0.000       0.015       0.033
days_used              4.203e-05   3.08e-05      1.363      0.173   -1.84e-05       0.000
normalized_new_price      0.4309      0.012     35.255      0.000       0.407       0.455
brand_name_Alcatel        0.0227      0.028      0.797      0.426      -0.033       0.078
brand_name_Asus           0.0223      0.029      0.775      0.439      -0.034       0.079
brand_name_BlackBerry    -0.0224      0.058     -0.386      0.700      -0.137       0.092
brand_name_Celkon        -0.0390      0.054     -0.724      0.469      -0.145       0.067
brand_name_Coolpad        0.0282      0.062      0.457      0.647      -0.093       0.149
brand_name_Gionee         0.0520      0.043      1.206      0.228      -0.033       0.137
brand_name_Google        -0.0254      0.075     -0.337      0.736      -0.173       0.122
brand_name_HTC           -0.0057      0.029     -0.200      0.842      -0.062       0.051
brand_name_Honor          0.0389      0.030      1.278      0.202      -0.021       0.099
brand_name_Huawei         0.0051      0.022      0.230      0.818      -0.039       0.049
brand_name_Infinix        0.0875      0.085      1.033      0.302      -0.079       0.254
brand_name_Karbonn        0.1015      0.055      1.833      0.067      -0.007       0.210
brand_name_LG            -0.0059      0.024     -0.248      0.804      -0.053       0.041
brand_name_Lava           0.0405      0.049      0.821      0.412      -0.056       0.137
brand_name_Lenovo         0.0526      0.024      2.180      0.029       0.005       0.100
brand_name_Meizu         -0.0056      0.040     -0.139      0.889      -0.085       0.074
brand_name_Micromax      -0.0264      0.029     -0.915      0.360      -0.083       0.030
brand_name_Microsoft      0.1021      0.079      1.294      0.196      -0.053       0.257
brand_name_Motorola      -0.0039      0.031     -0.128      0.898      -0.064       0.056
brand_name_Nokia          0.0779      0.034      2.301      0.021       0.011       0.144
brand_name_OnePlus        0.0780      0.067      1.162      0.245      -0.054       0.210
brand_name_Oppo           0.0197      0.028      0.702      0.483      -0.035       0.075
brand_name_Panasonic      0.0635      0.040      1.570      0.117      -0.016       0.143
brand_name_Realme         0.0392      0.048      0.823      0.410      -0.054       0.133
brand_name_Samsung       -0.0241      0.020     -1.220      0.222      -0.063       0.015
brand_name_Sony          -0.0543      0.032     -1.677      0.094      -0.118       0.009
brand_name_Spice         -0.0076      0.051     -0.149      0.881      -0.107       0.092
brand_name_Vivo          -0.0082      0.029     -0.280      0.779      -0.065       0.049
brand_name_XOLO           0.0223      0.040      0.564      0.573      -0.055       0.100
brand_name_Xiaomi         0.0941      0.029      3.293      0.001       0.038       0.150
brand_name_ZTE            0.0016      0.027      0.057      0.955      -0.052       0.055
os_Others                -0.0516      0.032     -1.635      0.102      -0.114       0.010
os_Windows               -0.0204      0.045     -0.453      0.650      -0.109       0.068
os_iOS                   -0.0628      0.046     -1.358      0.175      -0.154       0.028
4g_yes                    0.0529      0.016      3.337      0.001       0.022       0.084
5g_yes                   -0.0722      0.031     -2.302      0.021      -0.134      -0.011
==============================================================================
Omnibus:                      223.194   Durbin-Watson:                   1.911
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              422.387
Skew:                          -0.618   Prob(JB):                     1.90e-92
Kurtosis:                       4.633   Cond. No.                     7.69e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.69e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
In [58]:
#rechecking the variation inflation factor for the columns in x_train
#creating a dataframe called vif 
#one column called Variables holds the columns in x.train
#one column called Values holds the variation inflation factor for the corresponding columns
#using for loop to calculate vif with variation_inflation_factor() function for each column in x_train
vif=pd.DataFrame()
vif['Variables'] = x_train.columns
vif['Values'] = [variance_inflation_factor(x_train.values, i) 
             for i in range(len(x_train.columns))] 
                    
vif #displaying vif dataframe
Out[58]:
Variables Values
0 const 3652189.313723
1 main_camera_mp 2.280685
2 selfie_camera_mp 2.785809
3 int_memory 1.361090
4 ram 2.243688
5 battery 3.832812
6 weight 2.988550
7 release_year 4.714847
8 days_used 2.647808
9 normalized_new_price 3.054145
10 brand_name_Alcatel 1.202265
11 brand_name_Asus 1.198891
12 brand_name_BlackBerry 1.123884
13 brand_name_Celkon 1.172478
14 brand_name_Coolpad 1.052543
15 brand_name_Gionee 1.090087
16 brand_name_Google 1.044376
17 brand_name_HTC 1.220847
18 brand_name_Honor 1.265332
19 brand_name_Huawei 1.490147
20 brand_name_Infinix 1.061741
21 brand_name_Karbonn 1.068827
22 brand_name_LG 1.344073
23 brand_name_Lava 1.068851
24 brand_name_Lenovo 1.294022
25 brand_name_Meizu 1.134233
26 brand_name_Micromax 1.225923
27 brand_name_Microsoft 1.494091
28 brand_name_Motorola 1.256747
29 brand_name_Nokia 1.484503
30 brand_name_OnePlus 1.078405
31 brand_name_Oppo 1.368931
32 brand_name_Panasonic 1.104229
33 brand_name_Realme 1.167275
34 brand_name_Samsung 1.574092
35 brand_name_Sony 1.211511
36 brand_name_Spice 1.080580
37 brand_name_Vivo 1.316824
38 brand_name_XOLO 1.114832
39 brand_name_Xiaomi 1.312174
40 brand_name_ZTE 1.262301
41 os_Others 1.517099
42 os_Windows 1.592227
43 os_iOS 1.175537
44 4g_yes 2.461295
45 5g_yes 1.796667

Observations:¶

All of the variables have a VIF that is less than 5 now.

In [59]:
#Viewing the variables with high p-values because they do not have a significant effect on prediction
#Identifying variables that have p-value greater than 5
olsmodel6.pvalues[olsmodel6.pvalues > .05]
Out[59]:
int_memory              0.177078
battery                 0.580329
days_used               0.378412
brand_name_Alcatel      0.108414
brand_name_Asus         0.320032
brand_name_BlackBerry   0.755715
brand_name_Celkon       0.678724
brand_name_Coolpad      0.498662
brand_name_Gionee       0.320038
brand_name_Google       0.676405
brand_name_HTC          0.893951
brand_name_Huawei       0.404902
brand_name_Infinix      0.269453
brand_name_LG           0.848910
brand_name_Lava         0.257059
brand_name_Meizu        0.996814
brand_name_Micromax     0.527715
brand_name_Microsoft    0.188100
brand_name_Motorola     0.877901
brand_name_OnePlus      0.158124
brand_name_Oppo         0.330456
brand_name_Panasonic    0.060981
brand_name_Realme       0.375372
brand_name_Samsung      0.357625
brand_name_Sony         0.129866
brand_name_Spice        0.814073
brand_name_Vivo         0.802564
brand_name_XOLO         0.323316
brand_name_ZTE          0.815072
os_Windows              0.696827
os_iOS                  0.117809
dtype: float64
In [60]:
#dropping all the variables that have p-value greater than 5 from x-train
x_train = x_train.drop(['int_memory', 'battery', 'days_used','os_Windows','os_iOS','brand_name_Alcatel', 'brand_name_Asus',
                       'brand_name_BlackBerry', 'brand_name_Celkon', 'brand_name_Coolpad',
                       'brand_name_Gionee', 'brand_name_Google', 'brand_name_HTC',
                       'brand_name_Huawei', 'brand_name_Infinix','brand_name_LG','brand_name_Lava',
                       'brand_name_Meizu', 'brand_name_Micromax', 'brand_name_Microsoft','brand_name_Motorola',
                       'brand_name_OnePlus','brand_name_Oppo','brand_name_Panasonic','brand_name_Realme','brand_name_Samsung',
                       'brand_name_Sony','brand_name_Spice','brand_name_Vivo', 'brand_name_XOLO','brand_name_ZTE'], axis=1)

#dropping those columns from x_test as well
x_test = x_test.drop(['int_memory', 'battery', 'days_used','os_Windows','os_iOS','brand_name_Alcatel', 'brand_name_Asus',
                       'brand_name_BlackBerry', 'brand_name_Celkon', 'brand_name_Coolpad',
                       'brand_name_Gionee', 'brand_name_Google', 'brand_name_HTC',
                       'brand_name_Huawei', 'brand_name_Infinix','brand_name_LG','brand_name_Lava',
                       'brand_name_Meizu', 'brand_name_Micromax', 'brand_name_Microsoft','brand_name_Motorola',
                       'brand_name_OnePlus','brand_name_Oppo','brand_name_Panasonic','brand_name_Realme','brand_name_Samsung',
                       'brand_name_Sony','brand_name_Spice','brand_name_Vivo', 'brand_name_XOLO','brand_name_ZTE'], axis=1)

#building a new model called olsmodel7 for the modified x_train with y_train
olsmodel7 = sm.OLS(y_train, x_train).fit()
print(olsmodel7.summary()) #printing the summary of olsmodel7
                              OLS Regression Results                             
=================================================================================
Dep. Variable:     normalized_used_price   R-squared:                       0.840
Model:                               OLS   Adj. R-squared:                  0.839
Method:                    Least Squares   F-statistic:                     897.6
Date:                   Thu, 16 Nov 2023   Prob (F-statistic):               0.00
Time:                           21:37:11   Log-Likelihood:                 82.834
No. Observations:                   2417   AIC:                            -135.7
Df Residuals:                       2402   BIC:                            -48.81
Df Model:                             14                                         
Covariance Type:               nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                  -55.6229      6.880     -8.085      0.000     -69.114     -42.132
main_camera_mp           0.0212      0.001     15.112      0.000       0.018       0.024
selfie_camera_mp         0.0142      0.001     13.339      0.000       0.012       0.016
ram                      0.0230      0.005      4.578      0.000       0.013       0.033
weight                   0.0016   6.04e-05     27.250      0.000       0.002       0.002
release_year             0.0283      0.003      8.287      0.000       0.022       0.035
normalized_new_price     0.4351      0.011     40.043      0.000       0.414       0.456
brand_name_Honor         0.0526      0.028      1.897      0.058      -0.002       0.107
brand_name_Karbonn       0.1230      0.055      2.248      0.025       0.016       0.230
brand_name_Lenovo        0.0554      0.022      2.557      0.011       0.013       0.098
brand_name_Nokia         0.0626      0.031      2.032      0.042       0.002       0.123
brand_name_Xiaomi        0.0930      0.026      3.615      0.000       0.043       0.143
os_Others               -0.1429      0.028     -5.047      0.000      -0.198      -0.087
4g_yes                   0.0459      0.015      3.053      0.002       0.016       0.075
5g_yes                  -0.0673      0.031     -2.202      0.028      -0.127      -0.007
==============================================================================
Omnibus:                      242.954   Durbin-Watson:                   1.915
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              470.533
Skew:                          -0.656   Prob(JB):                    6.68e-103
Kurtosis:                       4.718   Cond. No.                     2.92e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.92e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
In [61]:
#brand_name_Honor has a p-value greater than 5
#dropping brand_name_Honor from x_train
x_train = x_train.drop(['brand_name_Honor'], axis=1)
#removing brand_name_Honor column from x_test as well
x_test = x_test.drop(['brand_name_Honor'], axis=1)

#building a new model called olsmodel8 for the modified x_train with y_train
olsmodel8 = sm.OLS(y_train, x_train).fit()
print(olsmodel8.summary()) #printing the summary of olsmodel8
                              OLS Regression Results                             
=================================================================================
Dep. Variable:     normalized_used_price   R-squared:                       0.839
Model:                               OLS   Adj. R-squared:                  0.838
Method:                    Least Squares   F-statistic:                     965.3
Date:                   Thu, 16 Nov 2023   Prob (F-statistic):               0.00
Time:                           21:37:11   Log-Likelihood:                 81.025
No. Observations:                   2417   AIC:                            -134.0
Df Residuals:                       2403   BIC:                            -52.99
Df Model:                             13                                         
Covariance Type:               nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                  -56.5564      6.866     -8.237      0.000     -70.020     -43.093
main_camera_mp           0.0213      0.001     15.241      0.000       0.019       0.024
selfie_camera_mp         0.0143      0.001     13.404      0.000       0.012       0.016
ram                      0.0230      0.005      4.571      0.000       0.013       0.033
weight                   0.0016   6.04e-05     27.262      0.000       0.002       0.002
release_year             0.0287      0.003      8.441      0.000       0.022       0.035
normalized_new_price     0.4338      0.011     39.980      0.000       0.413       0.455
brand_name_Karbonn       0.1223      0.055      2.235      0.026       0.015       0.230
brand_name_Lenovo        0.0537      0.022      2.479      0.013       0.011       0.096
brand_name_Nokia         0.0609      0.031      1.977      0.048       0.000       0.121
brand_name_Xiaomi        0.0895      0.026      3.488      0.000       0.039       0.140
os_Others               -0.1437      0.028     -5.077      0.000      -0.199      -0.088
4g_yes                   0.0458      0.015      3.046      0.002       0.016       0.075
5g_yes                  -0.0653      0.031     -2.136      0.033      -0.125      -0.005
==============================================================================
Omnibus:                      241.256   Durbin-Watson:                   1.914
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              464.306
Skew:                          -0.654   Prob(JB):                    1.50e-101
Kurtosis:                       4.703   Cond. No.                     2.91e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.91e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

Observations:¶

All of the p-values are below 0.05 after dropping the high p-value columns. The R-squared and Adjusted R-squared have decreased to some extent, but they are still at a good number around 0.839 and 0.838.

Linearity and Independence¶

In [62]:
#creating a plot with Residuals vs. Fitted Values
#making a dataframe with the Residuals and Fitted Values
lin_ind = pd.DataFrame()

lin_ind['Residuals'] = olsmodel8.resid #adding a column called Residuals with residuals from olsmodel8
lin_ind['Fitted'] = olsmodel8.fittedvalues #adding a column called Fitted with fitted values from olsmodel8

#creating residual plot using data from lin_ind, placing fitted on x axis and residuals on y-axis
#lowess is set to True to display curved line through the points to better understand the shape
#color is set to steel blue
sns.residplot(data=lin_ind, x='Fitted', y='Residuals', color='steelblue', lowess=True)
plt.xlabel('Fitted Values') #setting title of x-axis
plt.ylabel('Residuals') #setting title of y-axis
plt.title('Residuals vs. Fitted Values') #setting title of residual plot
plt.show(); #displaying residual plot

Observations:¶

  • There is no pattern between the Fitted Values and Residuals.
  • The model meets the assumptions of Linearity and Independence.

Normality¶

In [63]:
#creating a histogram for Residuals to view their distribution
#kde=True used to view density curve
sns.histplot(data=lin_ind, x='Residuals', kde=True) #using data from lin_ind, plotting Residuals on x-axis
plt.title('Distribution of Residuals') #setting title of histogram
plt.xlabel('Residuals') #setting title of x-axis
plt.show(); #displaying histogram

Observations:¶

Although the distribution is very slightly skewed to the left, it seems to follow a normal distribution due to the visible bell-curve.

In [64]:
#creating a Q-Q plot of residuals
#importing scipy.stats to create probability plot
import scipy.stats as stats

#creating probability plot using data from Residuals and distribution is set to norm
#plot is set to plt because mathplotlib.pyplot is being used to plot graph points
stats.probplot(lin_ind['Residuals'], dist='norm', plot=plt)
plt.show(); #displaying probability plot

Observations:¶

Although the ends are slightly different, the majority of the data appears to form a straight line.

Shapiro-Wilk Test: Are the residuals normally distributed?¶
  • $H_0$ (Null Hypothesis): The residuals are normally distributed.
  • $H_a$ (Alternate Hypothesis): The residuals are not normally distributed.
In [65]:
#using Shapiro-Wilk test to check if Residuals are normally distributed
stats.shapiro(lin_ind['Residuals'])
Out[65]:
ShapiroResult(statistic=0.9682212471961975, pvalue=1.0571945927045805e-22)

Observations:¶

  • The p-value for the Shapiro-Wilk test is small enough to state that the residuals are not normally distributed.
  • The histogram of the distribution of residuals supports a normal distribution.
  • The Q-Q plot also supports a normal distribution.
  • Since the histogram and Q-Q plot support a normal distribution, it can be assumed that the model meets the assumption for normality.

Homoscedasticity¶

Goldfeld-Quandt Test: Do the residuals have homoscedasticity?¶
  • $H_0$ (Null Hypothesis): The residuals are homoscedastic or symmetrically distributed.
  • $H_a$ (Alternate Hypothesis): The residuals are heteroscedastic or non-symmetrically distributed.
In [66]:
#using the goldfeldquandt test to examine homoscedasticity in Residuals
#importing statsmodels.stats.api and statsmodels.compat to carry out the goldfeldquandt test
import statsmodels.stats.api as sms
from statsmodels.compat import lzip

name=['F statistic', 'p-value'] #labeling the F statistic and p-value
test = sms.het_goldfeldquandt(lin_ind['Residuals'],x_train) #carrying out the test
lzip(name,test) #to display both F-statistic and p-value
Out[66]:
[('F statistic', 1.0181008983065767), ('p-value', 0.37830265087249293)]

Observations:¶

  • The null hypothesis cannot be rejected because the p-value is large.
  • The residuals are homoscedastic or symmetrically distributed.
  • The model meets the assumption for homoscedasticity.

Final Model¶

In [67]:
#displaying the final model again
olsmodel8 = sm.OLS(y_train, x_train).fit()
print(olsmodel8.summary()) #printing the summary of olsmodel8
                              OLS Regression Results                             
=================================================================================
Dep. Variable:     normalized_used_price   R-squared:                       0.839
Model:                               OLS   Adj. R-squared:                  0.838
Method:                    Least Squares   F-statistic:                     965.3
Date:                   Thu, 16 Nov 2023   Prob (F-statistic):               0.00
Time:                           21:37:14   Log-Likelihood:                 81.025
No. Observations:                   2417   AIC:                            -134.0
Df Residuals:                       2403   BIC:                            -52.99
Df Model:                             13                                         
Covariance Type:               nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                  -56.5564      6.866     -8.237      0.000     -70.020     -43.093
main_camera_mp           0.0213      0.001     15.241      0.000       0.019       0.024
selfie_camera_mp         0.0143      0.001     13.404      0.000       0.012       0.016
ram                      0.0230      0.005      4.571      0.000       0.013       0.033
weight                   0.0016   6.04e-05     27.262      0.000       0.002       0.002
release_year             0.0287      0.003      8.441      0.000       0.022       0.035
normalized_new_price     0.4338      0.011     39.980      0.000       0.413       0.455
brand_name_Karbonn       0.1223      0.055      2.235      0.026       0.015       0.230
brand_name_Lenovo        0.0537      0.022      2.479      0.013       0.011       0.096
brand_name_Nokia         0.0609      0.031      1.977      0.048       0.000       0.121
brand_name_Xiaomi        0.0895      0.026      3.488      0.000       0.039       0.140
os_Others               -0.1437      0.028     -5.077      0.000      -0.199      -0.088
4g_yes                   0.0458      0.015      3.046      0.002       0.016       0.075
5g_yes                  -0.0653      0.031     -2.136      0.033      -0.125      -0.005
==============================================================================
Omnibus:                      241.256   Durbin-Watson:                   1.914
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              464.306
Skew:                          -0.654   Prob(JB):                    1.50e-101
Kurtosis:                       4.703   Cond. No.                     2.91e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.91e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

Displaying Performance Metrics for Final Model¶

In [68]:
#assigning the R-squared of the final model to r_squared variable
final_r_squared = olsmodel8.rsquared
#assigning the Adjusted R-squared of the final model to adjusted_r_squared variable
finaladj_r_squared = olsmodel8.rsquared_adj
#assigning Mean Absolute Error to final_mae_training variable for training set
final_mae_training = mean_absolute_error(y_train, olsmodel8.predict(x_train))
#assigning Mean Absolute Error to final_mae_testing variable for testing set
final_mae_testing = mean_absolute_error(y_test, olsmodel8.predict(x_test))
#assigning Mean Square Error to final_mse_training variable for training set
final_mse_training = mean_squared_error(y_train, olsmodel8.predict(x_train))
#assinging Mean Square Error to final_mse_testing variable for testing set
final_mse_testing = mean_squared_error(y_test, olsmodel8.predict(x_test))
#assigning Root Mean Square Error to final_rmse_training variable for training set
final_rmse_training = np.sqrt(final_mse_training)
#assigning Root Mean Square Error to final_rmse_testing variable for testing set
final_rmse_testing = np.sqrt(final_mse_testing)

#creating a dataframe with all of the above performance metrics
#creating small lists of performance metric labels and the corresponding variables from above to make one whole list called performance1
performance1 = [['R-squared', final_r_squared], ['Adjusted R-squared', finaladj_r_squared], ['Training MAE', final_mae_training],
               ['Testing MAE', final_mae_testing],['Training MSE', final_mse_training],['Testing MSE', final_mse_testing]
                ,['Training RMSE', final_rmse_training], ['Testing RMSE', final_rmse_testing]]
#creating a dataframe from the performance list and adding titles to the columns
final_model_performance = pd.DataFrame(performance1, columns = ['Performance Metric', 'Value'])
final_model_performance #dataframe with the metrics and corresponding values
Out[68]:
Performance Metric Value
0 R-squared 0.839290
1 Adjusted R-squared 0.838421
2 Training MAE 0.182984
3 Testing MAE 0.187646
4 Training MSE 0.054753
5 Testing MSE 0.058312
6 Training RMSE 0.233994
7 Testing RMSE 0.241480

Observations:¶

  • The performance metrics remain almost the same from the very first model that was built. This means that the model is consistent in performance until the final model.
  • 83.9% of the variance in the training set was measured by the model.
  • The model is not an underfit because the Adjusted R-squared is 0.838, and R-squared is 0.839.
  • The model is not an overfit because the Mean Absolute Error, Mean Squared Error and Root Mean Squared Error are similar for the training and testing sets.
  • There is a 0.18 error in the model's ability to predict normalized_used_device prices on the test set.
  • Overall, this model is a good fit.

Final Model Equation¶

In [69]:
all_variables = len(x_train.columns) #assigning the length of the columns in x_train to a variable called all_variables
print('normalized_used_price =') #first printing the 'normalized_used_price = ' 

for i in range(all_variables): #for loop for every column in the length of x_train
    if i == 0: #if the column coefficient is equal to 0
        print(olsmodel8.params[i], '+') #print the coefficient and +
    elif i < (all_variables-1): #else if the coefficient is not the last variable
        print(olsmodel8.params[i],'*',x_train.columns[i], '+') #print coefficient*column and +
    else: #else (last column)
        print(olsmodel8.params[i],'*',x_train.columns[i]) #print coefficient*column
normalized_used_price =
-56.55642408210383 +
0.021335092174868883 * main_camera_mp +
0.014296909931711427 * selfie_camera_mp +
0.02295780622642629 * ram +
0.0016475080269465867 * weight +
0.028734621986767902 * release_year +
0.43383415709499906 * normalized_new_price +
0.12233049116897698 * brand_name_Karbonn +
0.05366715043017031 * brand_name_Lenovo +
0.060899305072687704 * brand_name_Nokia +
0.08952854580976108 * brand_name_Xiaomi +
-0.1437416519758557 * os_Others +
0.045806756390413136 * 4g_yes +
-0.06531508684395132 * 5g_yes

Observations:¶

normalized_used_price = -56.55642408210383 + 0.021335092174868883 main_camera_mp + 0.014296909931711427 selfie_camera_mp + 0.02295780622642629 ram + 0.0016475080269465867 weight + 0.028734621986767902 release_year + 0.43383415709499906 normalized_new_price + 0.12233049116897698 brand_name_Karbonn + 0.05366715043017031 brand_name_Lenovo + 0.060899305072687704 brand_name_Nokia + 0.08952854580976108 brand_name_Xiaomi + -0.1437416519758557 os_Others + 0.045806756390413136 4g_yes + -0.06531508684395132 * 5g_yes

Actionable Insights and Recommendations¶

When all of the other variables are constant:

  • A one unit increase in the main camera megapixels will increase the price of the used device by 0.0213 units.
  • A one unit increase in the selfie camera megapixels will increase the price of the used device by about 0.0143 units.
  • A one unit increase in the RAM will increase the price of the used device by about 0.0230 units.
  • A one unit increase in the weight will increase the price of the used device by 0.0016 units.
  • A one unit increase in the release year will increase the price of the used device by 0.0287 units.
  • A one unit increase in the new price of a device will increase the price of the used device by 0.4338 units.

The brands Karbonn, Lenovo, Nokia, and Xiaomi all have a positive effect on the price of used devices.

  • Having the brand Karbonn can increase the price of used devices by 0.1223 units.
  • Having the brand Lenovo can increase the price of used devices by about 0.0537 units.
  • Having the brand Nokia can increase the price of used devices by about 0.0609 units.
  • Having the brand Xiaomi can increase the price of used devices by 0.0895 units.

An operating system other than Android, iOS, and Windows can decrease the price of used devices by 0.1437 units. The category others can be very broad. Gathering more data on used devices with particular operating systems and analyzing their effect can be helpful in understanding whether they really create a negative impact on the price of a used device, when compared to Android, iOS, and Windows.

Having 5G seems to decrease the used device price by 0.0653 units, which is interesting.

Qualities of used devices that allow them to be sold for a higher price and can possibly help increase profit:

  • Used devices that have a larger camera size and more RAM can have higher used device prices.
  • Used devices with recent release year and more weight can also have higher used device prices.
  • The brands Karbonn, Lenovo, Nokia, and Xiaomi can have higher used device prices.

More research should be gathered on the effects of 5G and 4G on used device prices and why 5G seems to have a negative impact on the prices.